myHadoop - Hadoop-on-Demand on Traditional HPC Resources

نویسندگان

Sriram Krishnan

Mahidhar Tatineni

Chaitanya Baru

چکیده

Traditional High Performance Computing (HPC) resources, such as those available on the TeraGrid, support batch job submissions using Distributed Resource Management Systems (DRMS) like TORQUE or the Sun Grid Engine (SGE). For large-scale data intensive computing, programming paradigms such as MapReduce are becoming popular. A growing number of codes in scientific domains such as Bioinformatics and Geosciences are being written using open source MapReduce tools such as Apache Hadoop. It has proven to be a challenge for Hadoop to co-exist with existing HPC resource management systems, since both provide their own job submissions and management, and because each system is designed to have complete control over its resources. Furthermore, Hadoop uses a shared-nothing style architecture, whereas most HPC resources employ a shared-disk setup. In this paper, we describe myHadoop, a framework for configuring Hadoop on-demand on traditional HPC resources, using standard batch scheduling systems. With myHadoop, users can develop and run Hadoop codes on HPC resources, without requiring root-level privileges. Here, we describe the architecture of myHadoop, and evaluate its performance for a few sample, scientific use-case scenarios. myHadoop is open source, and available for download on SourceForge.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Pilot-Abstraction: A Valid Abstraction for Data-Intensive Applications on HPC, Hadoop and Cloud Infrastructures?

HPC environments have traditionally been designed to meet the compute demand of scientific applications and data has only been a second order concern. With science moving toward data-driven discoveries relying more and more on correlations in data to form scientific hypotheses, the limitations of existing HPC approaches become apparent: Architectural paradigms such as the separation of storage ...

متن کامل

Big Data at HPC Wales

This paper describes an automated approach to handling Big Data workloads on HPC systems. We describe a solution that dynamically creates a unified cluster based on YARN in an HPC Environment, without the need to configure and allocate a dedicated Hadoop cluster. The end user can choose to write the solution in any combination of supported frameworks, a solution that scales seamlessly from a fe...

متن کامل

Hadoop on a Low-Budget General Purpose HPC Cluster in Academia

In the last decade, we witnessed an increasing interest in High Performance Computing (HPC) infrastructures, which play an important role in both academic and industrial research projects. At the same time, due to the increasing amount of available data, we also witnessed the introduction of new frameworks and applications based on the MapReduce paradigm (e.g., Hadoop). Traditional HPC systems ...

متن کامل

Scalable Inverted Indexing on NoSQL Table Storage

The development of data intensive problems in recent years has brought new requirements and challenges to storage and computing infrastructures. Researchers are not only doing batch loading and processing of large scale of data, but also demanding the capabilities of incremental updates and interactive analysis. Therefore, extending existing storage systems to handle these new requirements beco...

متن کامل

Cloud Solutions for High Performance Computing: Oxymoron or Realm?

Preliminary notes In the last years a strong interest of the HPC (High Performance Computing) community raised towards cloud computing. There are many apparent benefits of doing HPC in a cloud, the most important of them being better utilization of computational resources, efficient charge back of used resources and applications, on-demand and dynamic reallocation of computational resources bet...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2004

myHadoop - Hadoop-on-Demand on Traditional HPC Resources

نویسندگان

چکیده

منابع مشابه

Pilot-Abstraction: A Valid Abstraction for Data-Intensive Applications on HPC, Hadoop and Cloud Infrastructures?

Big Data at HPC Wales

Hadoop on a Low-Budget General Purpose HPC Cluster in Academia

Scalable Inverted Indexing on NoSQL Table Storage

Cloud Solutions for High Performance Computing: Oxymoron or Realm?

عنوان ژورنال:

اشتراک گذاری